Abstract
Video prediction is a promising task in computer vision for many real-world applications and worth exploring. Most existing methods generate new frames based on appearance features with few constrain, which results in blurry predictions. Recently, some motion-focused methods are proposed to alleviate the problem. However, it’s difficult to capture the object motions from a video sequence and apply the learned motions to appearance, due to variety and complexity of real-world motions. In this paper, an adaptive hierarchical motion-focused model is introduced to predict realistic future frames. This model takes advantage of hierarchical motion modeling and adaptive transformation strategy, which can achieve better motion understanding and applying. We train our model end to end and employ the popular adversarial training to improve the quality of generations. Experiments on two challenging datasets: Penn Action and UCF101, demonstrate that the proposed model is effective and competitive with outstanding approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mathieu, M., Couprie, C., Lecun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems (NIPS), Barcelona, pp. 613–621 (2016)
Villegas, R., Yang, J., Hong, S., et al.: Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033 (2017)
Lu, C., Hirsch, M., Scholkopf, B.: Flexible spatio-temporal networks for video prediction. In: Conference on Vision and Pattern Recognition (CVPR) (2017)
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Advances in Neural Information Processing Systems (NIPS), Barcelona (2016)
Liu, Z., Yeh, R. A., Tang, X., et al.: Video frame synthesis using deep voxel flow. In: International Conference on Computer Vision (ICCV) (2017)
Chen, X., Wang, W., Wang, J., et al.: Learning object-centric transformation for video prediction. In: Proceedings of the 2017 ACM on Multimedia Conference, pp. 1503–1512 (2017)
Villegas, R., Yang, J., Zou, Y., et al.: Learning to generate long-term future via hierarchical prediction. arXiv preprint arXiv:1704.05831 (2017)
Jia, X., De Brabandere, B., Tuytelaars, T., et al.: Dynamic filter networks. In: Advances in Neural Information Processing Systems (NIPS), Barcelon (2016)
Van Amersfoort, J., Kannan, A., Ranzato, M.A., et al.: Transformation-based models of video sequences. arXiv preprint arXiv:1701.08435 (2017)
Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: Conference on Vision and Pattern Recognition (CVPR) (2017)
Xue, T., Wu, J., Bouman, K., et al.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In: Advances in Neural Information Processing Systems (NIPS), Barcelona, pp. 91–99 (2016)
Song, Y., Viventi, J., Wang, Y.: Multi resolution LSTM for long term prediction in neural activity video. arXiv preprint arXiv:1705.02893 (2017)
Dai, J., Qi, H., Xiong, Y., et al.: Deformable convolutional networks. In: Conference on Vision and Pattern Recognition (CVPR) (2017)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NIPS), pp. 2672–2680 (2014)
Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: International Conference on Computer Vision (ICCV) (2013)
Soomro, K., Zamir, A. R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Wang, Z., Bovik, A.C., Sheikh, H.R., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Liang, X., Lee, L., Dai, W., et al.: Dual motion GAN for future-flow embedded video prediction. In: International Conference on Computer Vision (ICCV) (2017)
Byeon, W., Wang, Q., Srivastava, R.K., et al.: Fully context-aware video prediction. arXiv preprint arXiv:1710.08518 (2017)
Acknowledgement
This work is supported by Shenzhen Peacock Plan (20130408-183003656), Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467), and National Natural Science Foundation of China (NSFC, No.U1613209).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Tang, M., Wang, W., Chen, X., He, Y. (2018). Adaptive Hierarchical Motion-Focused Model for Video Prediction. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11164. Springer, Cham. https://doi.org/10.1007/978-3-030-00776-8_53
Download citation
DOI: https://doi.org/10.1007/978-3-030-00776-8_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00775-1
Online ISBN: 978-3-030-00776-8
eBook Packages: Computer ScienceComputer Science (R0)